Robust Order Statistics based Ensembles for Distributed Data Mining

نویسندگان

  • Kagan Tumer
  • Joydeep Ghosh
چکیده

Integrating the outputs of multiple classifiers via combiners or meta-learners has led to substantial improvements in several difficult pattern recognition problems. In the typical setting investigated till now, each classifier is trained on data taken or resampled from a common data set, or randomly selected partitions thereof, and thus experiences similar quality of training data. However, in distributed data mining involving heterogeneous databases, the nature, quality and quantity of data available to each site/classifier may vary substantially, leading to large discrepancies in their performance. In this chapter we introduce and investigate a family of meta-classifiers based on order statistics, for robust handling of such cases. Based on a mathematical modeling of how the decision boundaries are affected by order statistic combiners, we derive expressions for the reductions in error expected when such combiners are used. We show analytically that the selection of the median, the maximum and in general, the ith order statistic improves classification performance. Furthermore, we introduce the trim and spread combiners, both based on linear combinations of the ordered classifier outputs, and empirically show that they are significantly superior in the presence of outliers or uneven classifier performance. So they can be fruitfully applied to several heterogeneous distributed data mining situations, specially when it is not practical or feasible to pool all the data in a common data warehouse before attempting to analyze it. 1 Mining of Distributed Data Sources An implicit assumption in traditional statistical pattern recognition and machine learning algorithms is that the data to be used for model development is available as a single flat file. This assumption is valid for virtually all popular benchmark datasets such as those available from ELENA, Statlog or the UCI machine learning repository. Such datasets are small or medium sized, requiring a few megabytes at most. Thus the algorithms typically also assume that the entire data can fit in main memory, and do not address computational issues regarding scalability and “out-of-core” operations. The tremendous explosion in the amount of data gathering and warehousing in the past few years has generated very large and complex databases. Any effort in mining information from such databases has to address the fact that (i) data may be kept in several files as in interlinked relational databases, and information needed for decision making may be spread over more than one file. For example, the concept of “collective data mining” [Kargupta and Park, 2000] explicitly addresses “vertical partitioning” situations where the features or variables relevant to a classification decision are spread over multiple files, each accessible to only one classifier. (ii) the files may be spread across several disks or even across different geographical locations, and (iii) the statistical quality of data may vary widely. For example the percentage of cases involving financial or health-care fraud varies in different regions, and so does the amount of missing information. One can argue that by transfering all data to a single warehouse and performing a series of merges and joins, one can get a single (albeit very large), flat file. A traditional algorithm can be used after randomizing and subsampling this file. But in real applications this approach may not be feasible because of the computational, bandwidth and storage costs. In certain cases, it may not even be possible for a variety of practical reasons including security, privacy, proprietary nature of data, need for fault tolerant distribution of data and services, real-time processing requirements, statutary constraints imposed by law, etc. [Prodromidis et al., 2000]. Then there are two options. If the owners of the individual databases are willing to provide high level or summary information/decisions such as local classification estimates, and transmit this information to a central location, then a meta-learner can be applied to the component decisions to come up with a final, composite decision. Note that such high level information not only has reduced storage and bandwidth requirements, but also maintains the privacy of individual records [DuMouchel et al., 1999]. Otherwise one has to resort to a distributed computing framework such as the emerging field of COllective INtelligence (COIN), wherein techniques are developed such that local and independent computations can still increase a desired global utility function [Wolpert and Tumer, 1999]. The first option leads to several issues reminiscent of studies in decision fusion [Dasarathy, 1994] applied largely to multi-sensor fusion and distributed control problems. It is also related to the theory of

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cluster ensembles

Cluster ensembles combine multiple clusterings of a set of objects into a single consolidated clustering, often referred to as the consensus solution. Consensus clustering can be used to generate more robust and stable clustering results compared to a single clustering approach, perform distributed computing under privacy or sharing constraints, or reuse existing knowledge. This paper describes...

متن کامل

Robust learning from bites for data mining

Some methods from statistical machine learning and from robust statistics have two drawbacks. Firstly, they are computer-intensive such that they can hardly be used for massive data sets, say with millions of data points. Secondly, robust and non-parametric confidence intervals for the predictions according to the fitted models are often unknown. Here, we propose a simple but general method to ...

متن کامل

A High-Performance Model based on Ensembles for Twitter Sentiment Classification

Background and Objectives: Twitter Sentiment Classification is one of the most popular fields in information retrieval and text mining. Millions of people of the world intensity use social networks like Twitter. It supports users to publish tweets to tell what they are thinking about topics. There are numerous web sites built on the Internet presenting Twitter. The user can enter a sentiment ta...

متن کامل

A comprehensive benchmark between two filter-based multiple-point simulation algorithms

Computer graphics offer various gadgets to enhance the reconstruction of high-order statistics that are not correctly addressed by the two-point statistics approaches. Almost all the newly developed multiple-point geostatistics (MPS) algorithms, to some extent, adapt these techniques to increase the simulation accuracy and efficiency. In this work, a scrutiny comparison between our recently dev...

متن کامل

Entropy-based Consensus for Distributed Data Clustering

The increasingly larger scale of available data and the more restrictive concerns on their privacy are some of the challenging aspects of data mining today. In this paper, Entropy-based Consensus on Cluster Centers (EC3) is introduced for clustering in distributed systems with a consideration for confidentiality of data; i.e. it is the negotiations among local cluster centers that are used in t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000